Re-evaluating Open-Ended Evaluation of Large Language Models

Introduction

For an evaluation to be trusted, the underlying data must be of high quality and representative of downstream applications. Unfortunately, this condition has become difficult to hold, as modern LLMs are increasingly tested in an open-ended setting, with test data (e.g. prompts) crowdsourced from the public. The resulting test set can be biased and redundant, calling into question any ratings that report performance on average.

The question then is what test distribution should be used for evaluation, if not the average. In this work, we answer this question following a game-theoretic recipe, derived from an explicit evaluation objective. For instance, we might wish to rank models based on how they perform on prompts that discriminate among strong models — this objective might appear circular at first, but it precisely lets us transform evaluation data into a game with utility-maximising players, such that players negotiate their strategies simultanously until convergence. At convergence, task ratings and model ratings can be computed according to these equilibrium distributions: by definition, no player should have an incentive to deviate from sampling according to these equilibrium distributions.

To make this idea concrete, let's work through a case study using the livebench dataset — we should observe which tasks are the most discriminative among strong models and which models emerge as competitive against the task equilibrium.

Case Study: livebench.ai

Task-level analysis

In this section we analyze equilibrium ratings at the task-level, with per-task score the average score of a model over all prompts.

Evaluation-as-a-game: game = gamification(data).

The intuition behind game-theoretic equilibrium ratings is that we can do better than rating models (tasks) against the average tasks (models). Instead, we transform evaluation data (i.e. the livebench.ai score table shown below) into a game such that equilibria of this game reflect our evaluation goals:

There are many ways to define utility functions. Here, we define utility functions to reflect the goal of computing model (task) ratings that reflect performance on discriminative tasks (competitive models).

Let $r(a_t, a_m)$ be the scalar score of a model $a_m$ on a task $a_t$. We define the utility to the model player that plays model $a_m$ facing off an opponent model player that plays $a'_m$ to be $u_m(a_t, a_m, a'_m) = r(a_t, a_m) - r(a_t, a'_m)$ and the utility to the task player to be $u_t(a_t, a_m, a'_m) = |u_m(a_t, a_m, a'_m)|$. In other words, the model player wants to outperform competing models on the task played by the task player and the task player wants to separate the pair of models played by the two model players.

Solving the game: equilibria, ratings = solver(game).

Given the above definition of a game, we can now solve for equilibrium solutions of this 3-player game. These equilibria correspond to either a distribution over actions for each player (in the case of Nash equilibria, NE, or a distribution over joint-actions for all players (in the case of Correlated Equilibria, CE. Loosely speaking, equilibria are fixed-points of the game such that no player can profitably deviate from their current equilibrium strategy (e.g. a distribution over its actions).

We consider 2 equilibria in this analysis. These solutions differ in terms of what information is observed by each player, which affects which actions should receive high ratings. Here we enumerate these solutions, as well as an intuitive description on which actions we expect to receive high ratings:

Quantal-Response Equilibrium (QRE, aka NE): actions that are strong generalists tend to be rated highly;

Correlated Equilibrium (CE): actions that are strong generalists or generally capable specialists tend to be rated highly;

For all solution concepts, we compute their initialisation point to be the marginal distributions that maximise the affinity entropy, which is invariant to exactly redundant actions — introducing copies of an action should have no effect on the equilibrium the solvers would settle on as they would split their marginal probability mass at initialisation.

For each equilibrium solution of the game, we show each player's (marginal) equilibrium distribution over their actions (bottom). Each action $a$ is then rated according to their regret for playing their equilibrium strategy rather than $a$. We report the model and task ratings at the top.

Quantal-Response Equilibrium Ratings and Marginals (QRE, aka NE)

Max-Entropy Correlated Equilibrium Ratings and Marginals (CE)

The high-level take away is that claude-3-5-sonnet-20240620 is taking the top-spot on this dataset according to all equilibrium concepts, which appears consistent with external livebench ranking.

Among the evaluation tasks, several coding, reasoning and specific language tasks seem to be strong contenders. In fact, we can show this more clearly leveraging the equilibrium structure. We do so via marginal rating contribution analysis in the next section.

Marginal rating contribution analysis

How should we interpret equilibrium ratings? One way to interpret these ratings is to break each rating down into its contributors, or tasks that each contributes positively or negatively to the rating. Here, we see that models with high ratings occupy different strategic niches: they each perform well on a different set of discriminative tasks.

You can click on individual bars to highlight contributions from tasks of the same category. Holding shift to select multiple categories. Click-and-drag vertically on the left hand side rating chart would select a subset of models to focus on. Double-click anywhere to undo the selection.

Quantal-Response Equilibrium Marginal Contributions

Max-Entropy Correlated Equilibrium Marginal Contributions

We now break down each model rating (red-diamond) in terms of contribution from each task — summing over bars on each row recovers the equilibrium rating of each model. Let's focus on the top-ranked models (by selecting an interval on the left hand side rating chart).

We observe that the top-ranked claude-3-5-sonnet-20240620 is strong on all task cateogories with instruction-following its only relative weakness. Its strongest competitive advantage comes from coding, with plot_unscrambling, LCB_generation and code_completion the tasks where it has the most significant competitive advantages (you can filter tasks by category by clicking on the coding category). The rating breakdown of gemini-1.5-pro-exp-0801 looks entirely different, which derives all its positive rating from instruction-following and notably from the task summarize.

We observe that gemini-1.5-pro-exp-0801 ranks higher according to CE ratings than under NEs. This is because it is a strong specialist on the task summarize but is genereally less competitive.

Prompt-level analysis

While we have been focusing on task-level analysis so far, our equilibrium solving procedure can scale up to O(10k) tasks with O(10) models. This means we can carry out the same analysis at the level of individual prompt.

Consider each prompt to be an individual task, we can turn prompt-level evaluation data into a prompt-vs-model-vs-model game as before except that the prompt player chooses individual prompts, rather than collection of prompts (a.k.a tasks). This lets us identify the individual prompts that are the most comparatively adversarial to each model.

Citation

If you find this work useful, please consider citing it.

@inproceedings{
	liu2025reevaluating,
	title={Re-evaluating Open-ended Evaluation of Large Language Models},
	author={Siqi Liu and Ian Gemp and Luke Marris and Georgios Piliouras and Nicolas Heess and Marc Lanctot},
	booktitle={The Thirteenth International Conference on Learning Representations},
	year={2025},
	url={https://openreview.net/forum?id=kbOAIXKWgx}
}

Re-evaluating Open-Ended Evaluation of Large Language Models

A case study using the livebench.ai dataset†.

Introduction

Case Study: livebench.ai

Task-level analysis

Evaluation-as-a-game: game = gamification(data).

Solving the game: equilibria, ratings = solver(game).

Quantal-Response Equilibrium Ratings and Marginals (QRE, aka NE)

Max-Entropy Correlated Equilibrium Ratings and Marginals (CE)

Marginal rating contribution analysis

Quantal-Response Equilibrium Marginal Contributions

Max-Entropy Correlated Equilibrium Marginal Contributions

Prompt-level analysis

Citation

Re-evaluating Open-Ended Evaluation of
Large Language Models

A case study using the livebench.ai dataset^†.